{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes - Trabalho\n",
    "\n",
    "## Questão 1\n",
    "\n",
    "Implemente um classifacor Naive Bayes para o problema de predizer a qualidade de um carro. Para este fim, utilizaremos um conjunto de dados referente a qualidade de carros, disponível no [UCI](https://archive.ics.uci.edu/ml/datasets/car+evaluation). Este dataset de carros possui as seguintes features e classe:\n",
    "\n",
    "** Attributos **\n",
    "1. buying: vhigh, high, med, low\n",
    "2. maint: vhigh, high, med, low\n",
    "3. doors: 2, 3, 4, 5, more\n",
    "4. persons: 2, 4, more\n",
    "5. lug_boot: small, med, big\n",
    "6. safety: low, med, high\n",
    "\n",
    "** Classes **\n",
    "1. unacc, acc, good, vgood\n",
    "\n",
    "## Questão 2\n",
    "Crie uma versão de sua implementação usando as funções disponíveis na biblioteca SciKitLearn para o Naive Bayes ([veja aqui](http://scikit-learn.org/stable/modules/naive_bayes.html)) \n",
    "\n",
    "## Questão 3\n",
    "\n",
    "Analise a acurácia dos dois algoritmos e discuta a sua solução."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aluno: Marcos Felipe de Menezes Mota - 354080"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import random\n",
    "import math\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import classification_report, accuracy_score\n",
    "import random\n",
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def splitDataset(dataset, splitRatio):\n",
    "    trainSize = int(len(dataset) * splitRatio)\n",
    "    trainSet = []\n",
    "    copy = list(dataset)\n",
    "    while len(trainSet) < trainSize:\n",
    "        index = random.randrange(len(copy))\n",
    "        trainSet.append(copy.pop(index))\n",
    "    return [trainSet, copy]\n",
    "\n",
    "def separateByClass(dataset):\n",
    "    separated = {}\n",
    "    for i in range(len(dataset)):\n",
    "        vector = dataset[i]\n",
    "        if (vector[-1] not in separated):\n",
    "            separated[vector[-1]] = []\n",
    "        separated[vector[-1]].append(vector)\n",
    "    return separated\n",
    "\n",
    "\n",
    "def mean(numbers):\n",
    "    return sum(numbers)/float(len(numbers))\n",
    " \n",
    "def stdev(numbers):\n",
    "    avg = mean(numbers)\n",
    "    variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)\n",
    "    return math.sqrt(variance)\n",
    "\n",
    "def summarize(dataset):\n",
    "    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]\n",
    "    del summaries[-1]\n",
    "    return summaries\n",
    "\n",
    "def summarizeByClass(dataset):\n",
    "    separated = separateByClass(dataset)\n",
    "    summaries = {}\n",
    "    for classValue, instances in separated.items():\n",
    "        summaries[classValue] = summarize(instances)\n",
    "    return summaries\n",
    "\n",
    "def calculateProbability(x, mean, stdev):\n",
    "    if (stdev == 0.0):\n",
    "        exponent = 0.0\n",
    "        retorno = 0.0\n",
    "    else:\n",
    "        exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))\n",
    "        retorno = (1 / (math.sqrt(2*math.pi) * stdev)) * exponent\n",
    "    \n",
    "    return retorno\n",
    "\n",
    "def calculateClassProbabilities(summaries, inputVector):\n",
    "    probabilities = {}\n",
    "    for classValue, classSummaries in summaries.items():\n",
    "        probabilities[classValue] = 1\n",
    "        for i in range(len(classSummaries)):\n",
    "            mean, stdev = classSummaries[i]\n",
    "            x = inputVector[i]\n",
    "            probabilities[classValue] *= calculateProbability(x, mean, stdev)\n",
    "    return probabilities\n",
    "\n",
    "def predict(summaries, inputVector):\n",
    "    probabilities = calculateClassProbabilities(summaries, inputVector)\n",
    "    bestLabel, bestProb = None, -1\n",
    "    for classValue, probability in probabilities.items():\n",
    "        if bestLabel is None or probability > bestProb:\n",
    "            bestProb = probability\n",
    "            bestLabel = classValue\n",
    "    return bestLabel\n",
    "\n",
    "def getPredictions(summaries, testSet):\n",
    "    predictions = []\n",
    "    for i in range(len(testSet)):\n",
    "        result = predict(summaries, testSet[i])\n",
    "        predictions.append(result)\n",
    "    return predictions\n",
    "\n",
    "def getAccuracy(testSet, predictions):\n",
    "    correct = 0\n",
    "    for i in range(len(testSet)):\n",
    "        if testSet[i][-1] == predictions[i]:\n",
    "            correct += 1\n",
    "    return (correct/float(len(testSet))) * 100.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>low</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>med</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>small</td>\n",
       "      <td>high</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>med</td>\n",
       "      <td>low</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>vhigh</td>\n",
       "      <td>vhigh</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>med</td>\n",
       "      <td>med</td>\n",
       "      <td>unacc</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       0      1  2  3      4     5      6\n",
       "0  vhigh  vhigh  2  2  small   low  unacc\n",
       "1  vhigh  vhigh  2  2  small   med  unacc\n",
       "2  vhigh  vhigh  2  2  small  high  unacc\n",
       "3  vhigh  vhigh  2  2    med   low  unacc\n",
       "4  vhigh  vhigh  2  2    med   med  unacc"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = pd.read_csv(\"carData.csv\", header=None)\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0  1  2  3  4  5  6\n",
       "0  3  3  0  0  2  1  2\n",
       "1  3  3  0  0  2  2  2\n",
       "2  3  3  0  0  2  0  2\n",
       "3  3  3  0  0  1  1  2\n",
       "4  3  3  0  0  1  2  2"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print(dataset.shape)\n",
    "for i in range(7):\n",
    "    dataset[i] = dataset[i].astype('category')\n",
    "    dataset[i] = dataset[i].cat.codes\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[3, 3, 0, ..., 2, 1, 2],\n",
       "       [3, 3, 0, ..., 2, 2, 2],\n",
       "       [3, 3, 0, ..., 2, 0, 2],\n",
       "       ..., \n",
       "       [1, 1, 3, ..., 0, 1, 2],\n",
       "       [1, 1, 3, ..., 0, 2, 1],\n",
       "       [1, 1, 3, ..., 0, 0, 3]], dtype=int8)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = dataset.values\n",
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataset = dataset.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "O valor da acuracia é: 67.5925925925926\n"
     ]
    }
   ],
   "source": [
    "train, test = splitDataset(dataset, 0.75)\n",
    "summaries = summarizeByClass(train)\n",
    "predictions = getPredictions(summaries, test)\n",
    "accuracy = getAccuracy(test, predictions)\n",
    "\n",
    "print(\"O valor da acuracia é:\",accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Utilizando Nayve Bayes do Skitlearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of mislabeled points out of a total 1728 points : 644\n"
     ]
    }
   ],
   "source": [
    "dataset = pd.read_csv(\"carData.csv\", header=None)\n",
    "for i in range(7):\n",
    "    dataset[i] = dataset[i].astype('category')\n",
    "    dataset[i] = dataset[i].cat.codes\n",
    "y = dataset[6]\n",
    "del dataset[6]\n",
    "\n",
    "gnb = GaussianNB()\n",
    "X = dataset.values\n",
    "y_pred = gnb.fit(X, y).predict(X)\n",
    "print(\"Number of mislabeled points out of a total %d points : %d\" % (X.shape[0],(y != y_pred).sum()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "O valor da acuracia é: 0.627314814815\n"
     ]
    }
   ],
   "source": [
    "accuracy = accuracy_score(y, y_pred)\n",
    "print(\"O valor da acuracia é:\",accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
